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Abstract 

A new fast algorithm for clustering and classification of large collections of text documents 
is introduced. The new algorithm employs the bipartite graph that realizes the word- 
document matrix of the collection. Namely, the modularity of the bipartite graph is used as 
the optimization functional. Experiments performed with the new algorithm on a number 
of text collections had shown a competitive quality of the clustering (classification), and a 
record-breaking speed. 

Keywords: Text Clustering, Text Classification, Modularity 
1. Introduction 

We explore a possibility of clustering (or classification) of documents. Clustering an d clas- 
sification are methods for information retrieval (for a recent review see iBerrvl (|20q3)). The 
possibility we explore consists in combining two id eas consid e red p reviously. 

The first idea is co- clustering dPhillonl . l2nnil i. (|Zha et al.l . lioOll i. Co-clustering clusters 
along with the documents the words used in the documents. As an outcome, clusters of 
documents are generated along with corresponding clusters of words. This approach features 
the following advantages: Clusters of words generated as a byproduct of the approach can 
be used for interpretation of the clusters of documents; In the classification tasks, it is 
possible to use in the training sets separate words along with documents. The standard 
algorithm used with in the co-clustering approach to reach the result is the spectral clustering 
( von Luxbure . 200?! ). (Computationally, spectral clustering finds eigenvectors of the graph 
laplacian. With a number of tricks, the eigenvector s are used for clustering.) 

The second idea is modularity ( Newman . 20061 ). Modularity is a class of optimization 
functionals introduced in the studies of graph clustering. Let us compare modularity to 
other optimization function als appearing with i n the widely used approach to clustering 
based on generative models ( Zhong and Ghosh . 20051 ). These optimization functionals are 
various "distances" between the data and the model. Optimization consists in finding 
parameters of the model yielding the minimal distance. In contrast, the modularity is 
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optimal when a "distance" between the data and a null-model is maximal. The null-model 
is a key notion for the modularity idea. The null-model models the data without structure 
(the most random data). Concretely, modularity is defined as follows. A functional on 
graph partitionings is picked out. Modularity is an additive or multiplicative difference of 
the value the functional takes on the graph under study and the mean value it takes on 
the null-model. From the above comparison we conclude that using modularity is in a way 
less demanding than using generative models, because it is easier to model randomness 
than specific data. Compa rison of modularity-based ap proaches and generativ e models 
approaches is attempted in ( Karrer and Newman . 2011), (Bickel and Chen , 20091 ). 



Modularity has been used in text clustering ( Grineva et al. . 20091 ) . In this attempt, a 
dense weighted graph has been clustered. The nodes of the graph are the documents, all 
the documents are potentially linked to one another, the edges have weights characterizing 

similarity of the linked documents. 

In this paper we apply the modularity of (Newman, 20061 ) to the bipartite word-document 
graph. This is a very sparse bipartite graph G whose nodes are documents and words, and 
edges are between documents and words contained in them. The sparsity of G makes our 
approach practical. 

More technically, our work is based o n two facts. First, t he modularity can be opti- 
mized with fast and efficient algorithms ( Blondel et al. . 20081 ) that have complexity pro- 
portional to the number of links. (Here we point out t hat we independently developed 



an alg orithm similar to th e so called Louvain algorithm (jBlondel et al.l . |2008| ) before the 



paper ( Blondel et al. . 20081 ) appeared. We had used it in 2007 to cluster the citation graph 
of the papers from http : / / arxiv . org. The results of this clustering are accessible via 
http://xstructure.inr.ac.ru.) Second, the density of the graphs in our experiments 
was in the range from 0.0015 to 0.006 0. For such graphs, \E\ oc |y|log|V^|, where is 
the number of edges and \V\ is the number of vertexes of the graph. Also, the number of 
vertexes in our graph equals approximately the number of documents in the collection. 

We conclude that in the case under consideration the linearity of the algorithm in the 
number of edges of the graph almost implies the linearity in the number of documents. In 
this way we obtain a very fast algorithm. It allows one to cluster (classify) tens of millions of 
documents in a few hours with a typical computer hardware. Present ly, a clustering problem 
is considered to be a "large scale" if it involves up to 10^ documents ( Vries and Geva . 2010l ). 
With our algorithm, it is possible to raise this bar at least up to lO''' documents. 

The paper is organised as follows. In the next section, we outline the algorithm. In the 
third section, we present the results of experiments applying the new algorithm to various 
text collections. In the cocluding section, we briefly summarise our achievements. 



2. The Algorithm 

2.1 Clustering: The Basic Algorithm 

In this section, we outline the algorithm we used to maximize the modularity of the bipartite 
graph G modeling a collection of documents (its vertexes are documents and words of the 

1. The density of a graph is defined as 2|£'|/(|Vp — 1), where \E\ is the number of edges and \V\ is the 
number of the vertexes of the graph. 
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collection; an edge appears between a document and a word if the latter is contained in the 
former) . 

The modularity Q{P, G) is a functional de fined on the se t {P} whose members are 
partitions of the set of vertexes of the graph G ( Newman . 20061 ). 

As discussed above the modularity is a difference between the fraction of the edges inside 
the clusters for the graph under consideration and for the null model. For example, for a 
simple (unweighted and undirected) graph, the value it takes on a particular partition is 

«^.o)=i:(Ml). (1) 

i=l 

where the summation runs over the clusters of the partition, N is the number of clusters, 
li is the number of edges inside the ith cluster, L is the number of graph edges, and Di is 
the sum of degrees of vertexes inside the cluster i. 

Modularity can be used to determine an invariant of the graph G — the partition P that 
gives the modulari ty its maximal value . Generally, computing this invariant is an NP- 



complete problem (iBrandes et al.l . l2006l). There is a number of algorithms for computing 



an approximation to this invariant IS^, H) • 

For our particular case, where the graph under consideration is a bipartite one, the null 
model should be modified allowing for the edges to appear randomly on ly between th e two 
parts of the graph. Accordingly, equation ([T]) is transformed as follows tearbeAlioOTl ): 

q..(p,g)=j:(|-^), (2) 

i=l 

where D} {Df) is the sum of degrees of vertexes inside the first (second) part of the ith 
cluster, and L is the number of graph edges. 

Our algorithm is based on a use of an operation Tp to be defined below. It acts on 
any partition P' that can be obtained from the partition P involved in its definition by a 
coarsening, P' > P (this means that the subsets of P' can be obtained by merging some 
subsets of P). The outcome of Tp acting on P' is a new partition whose modularity is not 
less than the one of P': Q{TpP') > Q(P'). (Here and below we omit the second argument 
of Q{P,G) because the graph G is fixed.) This is the basic property of the operation Tp: 
its action "improves" the partition. The definition of Tp does not use any specific property 
of the quality functional Q, and can be given for any particular choice of the latter. We 
stress that Tp depends on the particular choice of the quality functional Q. 

To define Tp, we introduce an arbitrary numbering of the elements v, v G P (the notation 
V originates from the most refined partition of G whose members are separate vertexes). 
After that, instead of the set of elements v of the partition P we deal with the set of their 
numbers, v £ {1,2, ... , \P\}. 

The next step is to introduce coordinates on the set of P' > P. Each P' can be 
considered as a point in a space with |P| discrete coordinates; each coordinate takes an 
integer value from 1 to \P\. Indeed, each P' defines an equivalence relation on the numbers: 
v' ^ V if V and v' belong to the same subset of P'. The vth coordinate of P' can be defined 
as follows: 

P^ = maxf' (3) 
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So, by this formula, any set v is mapped to the set inside the same cluster of |P'| with the 
maximal number. Inversely, any point (xi, . . . , x\p\) of the discrete space {1, . . . , can 
be interpreted as a partition P' whose members are obtained by merging the subsets of P 
whose coordinates x^^p coincide. 

Now the functional Q can be considered as a function of \P\ discrete arguments: 

Q{P') = Q{P[,...,P[p^); (4) 

each argument runs from 1 to |P|. We are looking for the maximum of this function. 

To approximate the max imum, we can tak e any starting P' and use the discrete cyclic 
coordinate descent method (^Luenberger . \m± to obtain a point TpP' improving the par- 



tition P\ Q(TpP') > Q{P'). This concludes our definition of the operation Tp. 

The operat i on Tp can be used to describe the previously introduced Louvain algorithm 



( Blondel et al. . 20081 ). Indeed, the Louvain algorithm yields the partition ending the se- 



quence of partitions P„ = Tp^_^P„_i that starts from the most refined partition Pq whose 
members are the vertexes. 

Experimenting with classification of text collections, we have found that it is advanta- 
geous to use another sequence of partitions approaching the maximum, P„ = Tp^Tp^_-^^Pn-i. 
So, we start with the most refined partition Pq- The first step of the process yields 
Pi = TppPo (this is the case because Tp = Tp for any P), the second, P2 = Tp^^Tp^Pi, 
and so on. The process stops when its next step yields a partition whose modularity coin- 
cides with the one obtained on the previous step. 

Comparing this algorithm to the Louvain algorithm we point out that, in contrast to the 
Louvain algorithm, each step of our algorithm does not necessarily coarsen the partition, 
i.e. our P„ is not always more coarse than Pn-i- The results we obtain appear to be more 
accurate (in the sense to be defined latter on) than the ones obtained with the Louvain 
algorithm. 

This concludes the general description of our algorithm. 
2.2 Clustering: Finetuning 

Handmade classifications of large text collections have a number of classification levels. 
For example, the online arxive arxiv.org has three classification levels (e.g. Physics — 
Condensed Matter — Superconductivity), and the huge collection of web sites dmoz.org has 
more than three classification levels (the actual number of levels depends on the subject 
field). Such levels are not described with the above approach employing the modularity 
function. 



A handle on this is p rovided by the parametric modularity introduced in (iReichardt and Bornholdt , 

2006l l. (|Lambiottel . I2OI0I ). It is defined as follows: 



L L2 

1=1 

where an extra real positive parameter A had appeared. 

Let us give an example clarifying the meaning of the new parameter A. Consider a 
graph Gk which consists of K copies of the graph G. Let the modularity of G reach its 
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maximum value on the partition Pmax- This Gk gives a simple model of a graph with 
two classification levels naturally present: the upper level P2 has as its classes the separate 
copies of G, while the ground level Pi of the classification subdivides each copy of G on 
the subgraphs participating in Pmax- With this notations, Q{Gk,Pi) = Q+iG, Pmax) — 
Q-{G, Pmax)/ K , where denotes the first (second) term in the right hand side of 

©. Also, Q{Gk,P2) = 1 - l/K. Because Q± < 1, at large K, Q{Gk,P2) > Q{Gk,Pi). 
We conclude that in this case the modularity is unable to resolve the ground level of the 
classification if the number of subclasses at the upper level K is large enough (practically, 
this takes place at K ^ 10). We can speculate that there is a "resolution limit" beyond 
whi ch the modularity is unable to res olve the substructures in a graph. (For more on this 



Now consider the performance of the parametric modularity on the above graph Gk- 
In this example, the graph is not a bipartite one. So, we take as a parametric modularity 
the quantity Q(G, P, A) = Y2iLi (^h/L — XDf /{4:L'^)^ . Compare this formula with the above 
definition of the modularity for nonbipartite graphs ([1]). For this case, take X = K. We 
have Q{Gk,Pi,K) = Q{G,Pmax), and Q{Gk,P2,K) = 0. We conclude that taking X = K 
enables the parametric modularity to see namely the ground level of the classification. 

The big question in using the parametric modularity is how to find the "good values" 
of the parameter A. As we have seen, A has a meaning of the number of clusters on the 
upper level of clustering, and we normally do not know it beforehand. At this moment 
we do not give any prescription on defining A. In what follows, we use the parametric 
modularity to find our classifications. We always give the value of A with which one or 
another classification had been obtained. 

What we can state is that varying A is a useful tool. In our experiments, A was varied 
from 1 to 300. 

2.3 Clustering: Tidying up 

Applying the above clustering algorithm to various large graphs we observed appearance 
of long tails in the distribution of the clusters in the number of vertexes: Typically, along 
with a few large clusters, we obtain a large number of relatively small clusters. And the 
smaller is the cluster, the harder to interpret it. Also, it seems that the appearance of small 
clusters is not infrequently caused by minor peculiarities in the data. 

In the results we present below, the vertexes of the clusters belonging to the long tails 
are redistributed among a few large clusters. In this section, we describe the procedure of 
this redistribution of the "astray" vertexes. 

The redistribution was obtained with an operation similar to the above Tp. This op- 
eration, i?Ar, depends on a natural number N. It acts on any partition P with number of 
clusters larger than N, \P\ > N. 

First, the redistribution operation i?7v orders the clusters of the partition P by their 
size. Next, all the vertexes not included in the first A'^ clusters are counted. Let the number 
of these astray vertexes be M. A redistribution of the astray vertexes among the A'^ largest 
clusters can be pointed out with the set of coordinates (xi, xm)- The value taken by the 
coordinate Xk equals the number of the large cluster the kth astray vertex is redistributed 
to. 



see (iFortunato and Barthelemyl . |200' 
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As in the operation Tp, the optimal point in the space with the above coor dinates is de- 
termined by the modularity with the discrete cyclic coordinate descent method (|Luenbergeii . 
The only undetermined ingredient in the definition of the redistribution operation 



Rn is the starting point for the descent. 

The starting point for the descent was determined with the following procedure. The 
value of the first coordinate xi was determined by the optimal number of the large cluster 
for placing the first astray vertex in under the condition that the rest of the astray vertexes 
are considered as separate clusters. The value of the second coordinate X2 was determined 
similarly but under condition that the first of the astray vertexes is already placed into the 
large cluster number xi, and so on. 

Previously we described the sequence of partitions Pn = Tp^Tp^_-^Pn-i. It stops on a 
partition P. Our final result is Pf = TpgRjsfP, where N < \P\ is the number of clusters we 
choose to be present in the final clustering. As before, the leftest operation Tp^ improves 
the clustering (its action determines the optimal cluster for each vertex among the clusters 
obtained by the action of the redistribution operation Rn)- 



2.4 Classification 

A classification problem is given if a subset of the classification indexes is already given (the 
training set), and the rest should be generated. To clarify, the number of classes is preset 
to N. For a subset of vertexes (the training set) the correct classes are known. For the 
rest of vertexes (testing set) the correct classes should be determined. We attempt to solve 
the classification problem using its analogy to the problem of redistribution of the astray 
vertexes of the previous subsection. 

To solve the problem using modularity, we point out that the correct classification defines 
a partition on the set of documents obtained from joining the training and testing sets. The 
members of this partition are the classes consisting from the documents of the training set 
with addition of the correctly attributed documents from the testing set. We assume that 
this partition is the one that maximizes the parameterized modularity at a certain value of 
the parameter A. If A is known, this is a problem of maximization with constraints. The 
constrains fix the number of clusters to N and the distribution among the clusters for the 
training set. 

We look for approximate solution of this problem using the above redistribution op- 
eration Rjsf. Our approximation to the optimal classification is Pc = RnP where P is 
the partition with the training set correctly distributed and each of the rest of vertexes 
belonging to its own cluster. 



3. The Experiment 

Four document collections have been used for testing our algorithm. Three of them are 
among well known classical test collections — 20 News groups , Reuters 21578, and WEBKB4. 
We used pre-processed versions of these collections ( Cardoso-Cachopol ) . The fourth collec- 
tion (TripAdvisor dataset) is a collection of travelers reviews of the hotels th ey stayed 
in obtained via the popular resource tripadvisor . com (lOpinionAnalysisCorpusI ) . In this 
collection, all the reviews were classified into two classes — the positive and negative reviews. 
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Table 1 gives parameters of the collections. All four collections were used for clustering 
and classification. 



Dataset 


Total tl of docs 


jl of training docs 


of test docs 


[j of classes 


20 Newsgroups 


18821 


11293 


7528 


20 


Reuters-21578 


7674 


5485 


2189 


8 


WebKB4 


4199 


2803 


1396 


4 


TripAdvisor 


3000 


1800 


1200 


2 



Table 1: Parameters of the text collections 



The performance has been measured with the standard qu ality functional s. For clus- 
tering, the performance has been meas ured with the Purity ( Manning et al. . 20081 ) and 
Normalized Mutual Information (NMI) (jManning et al.l . |2008| ) (see below). Fo r class ifica- 
tion, it has been measured with micro and macro Fl-measures ( Manning et al. . 20081 ) (see 
below). 

The bipartite graphs have been formed using stemming, removing stop- words (the stop- 
list included 770 words) and rare words involved in less than five documents. Besides the 
graphs representing the document-word pairs, we also constructed larger graphs representing 
the document-word and document-bigram pairs (the bigram is a sequence of two words 
involved in a document). 

Table 2 gives parameters of the obtained graphs. 



Dataset 


jj of vert exes, Gl 


Jj of links, Gl 


jJ of vertexes, G2 


Jj of links G2 


20 Newsgroups 


43000 


1000300 


131000 


2000020 


Reuters-21578 


13000 


255000 


25000 


424000 


WebKB4 


9500 


275000 


27000 


500000 


TripAdvisor 


5400 


150000 


11200 


193000 



Table 2: Parameters of the bipartite graphs (Gl is the document-word graph, G2 is the 
graph with bigrams included) 



We used unit weights for the links in the graphs. (Experimenting with weighted links — 
we tested the standard tf-idf weights and weights generated via normalization by the doc- 
ument length in the ^2-iiorm — had not shown improvement sufficient to justify the trouble 
of using them.) 

3.1 Experiment: Clustering 

The clustering was performed by the following protocol. For each testing collection, op- 
timization of the parameterized modularity was performed for a sequence of values A = 
1, 1.5, 2, . . . with the objective of finding the suboptimal value of A. 

As mentioned above, the quality of clustering was measured with the Normalized Mu- 
tual Information (NMI) and Purity functionals. These functionals are maximal when the 
generated clustering coincides with a given "correct" clustering. Below we give the formulas 
for computing these functionals. The clusters of the given "correct" clustering are called 
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classes. The NMI functional reads 

log{Nm/N) Ni \og{Ni/N) 

Here summation in I is over the classes, in m over the generated clusters, N is the total 
number of documents, Ni {N„i) is the number of documents in class I (cluster m), Ni^^ is 
the number of documents in the overlap between class I and cluster m, The NMI takes its 
values in the interval (0, 1), and measures a similarity between the generated clustering and 
the known partitioning into classes. 

For completeness, and to facilitate comparison with other algorithms, we also computed 
a similar quality criterion — the Purity: 

Purity =^fcf^. (7) 



Table 3 gives the clustering results. It shows that the optimization in the value of A, and 
the use of bigrams improves the quality of clustering (measured with NMI) considerably. 



Dataset 


A 


Gl 


G2 


d of clusters 


NMI 


Purity 


tt of clusters 


NMI 


Purity 


20 Newsgroups 


1 


9 


0.58 


0.38 


18 


0.59 


0.43 


2.5 


93 


0.52 


0.62 


118 


0.60 


0.68 


2.5 


20 


0.59 


0.61 


20 


0.63 


0.67 


Reuters-21578 


1 


G 


0.5C 


0.80 


5 


0.63 


0.84 


WebKB4 


1 


11 


0.35 


0.70 


14 


0.34 


0.73 


1 


4 


0.37 


0.67 


4 


0.41 


0.70 


1.7 


12 


0.35 


0.68 


38 


0.37 


0.7 


1.7 


4 


0.37 


0.67 


4 


0.46 


0.76 


TripAdvisor 


1 


3 


0.35 


0.80 


3 


0.36 


0.81 


1 


2 


0.59 


0.92 


2 


0.52 


0.89 



Table 3: Clustering Results. The Gl and G2 columns give respectively results obtained 
with the document-word graph and with the graph involving bigrams. A is the 
modularity parameter. Numbers in italic were obtained with the projection onto 
the first K clusters ordered by their size {K equals the number of clusters in the 
training set). Numbers in bold give our best results. 



Table 4 compares our results with the results obtained with other algorithms. The latter 
were extracted from sources pointed out in the Table 4 caption. 

3.2 Experiment: Classification 

In the classification experiments, the suboptimal value of the parameter A was used deter- 
mined previously in the clustering experiments. 
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Dataset / Algorithm 


Modularity 


ExtPLSA 


MMF 


sc 


SKM 


CLGR 


NMF 


20 Newsgroups 


0.63 


0.54 


0.61 


0.46 








WebKB4 


0.46 


0.36 




0.45 


0.43 


0.54 


0.45 



Table 4: NMI values obtained with various methods. Columns are marked with the method 
names. Modularity is the method of this paper; Ex tPLSA is a version of the 



probabilistic latent semanti c analysis ( Kim et al. ^ 20081 ) ; MMF is a mixture of the 



Mises-Fisher distribution s ( Zhong and Ghosh . 20051 ): SC is th e spectral clusterin g 



(jZhong and Ghoshl . I2OO.5I ): SKM are the spherical X-means (^Wang et al 



CLGR is Clustering with Local and Global Regularization (IWang et al 
NMF is the nonnegative matrix factorization (jWang et al.l . 120071 ) . 



2007 ): 



20071 ): 



There are two standard classification quality measures ( Manning et al. . 20081 ) . 
averaged and macro-averaged: 



micro- 



micro-Fl = 

c 

macro-Fl = 



TPjc) 
D ■ 

N 



(8) 



Here the sum in the right hand side of the definitions runs over classes; D is the num- 
ber of documents to be classified, N is the number of classes, TP{c) is the number of 
correctly classified documents for class c, and F{c) = 2R{c)P{c)/{R{c) + P(c)), where 
R{c) = TP{c)/Ni{c), and P(c) = TP{c)/N2{c). In the last relations, iVi(c) (A^2(c)) are, 
respectively, the correct (actual) number of the documents from the testing set to be (have 
been) attributed to class c. 

Table 5 gives results of our classification experiments. The same table compares our 
results to the results obtained with other algorithms. 



Dataset 


Modularity Gl 


Modularity G2 


SVM 


N-Bayes 


K-NN 


mic 


mac 


mic 


mac 


mic 


mac 


mic 


mac 


mic 


mac 


20 Newsgroups 


78.70 


77.12 


82.19 


82.78 


82.84 


83.60 


81.03 




84.23 


79.07 


Reuters-21578 


91.23 


76.25 


92.77 


81.19 


96.98 


91.50 


96.07 




85.24 


83.2 


WebKB4 


80.66 


78.92 


85.24 


84.74 


89.68 


88.39 


83.52 




72.56 




TripAdvisor 


90.30 


90.10 


85.60 


85.60 















Table 5: The columns Modularity Gl and Modularity G2 give the micro- and macro- 
Fl values obtained with the algorithms of this paper. The rest of the columns 
list the values obtained with various methods: SVM with support vector ma- 



chine (ICardoso-Cachopd. 120071). (IGuo et al 



(|Guo et al.l2004l).(~rdoso-Cacliopj.l2007 ): K-NN with the K nearest neighbors 



20041 ) : N-Bayes with the naive Bayes 



method (ICardoso-Cachopol . 120071 ). (IGuo et al.1 . 120041 ) . 



9 



PiVOVAROV AND TrUNOV 



4. Conclusions 

We presented a new algorithm for clustering and classification of text collections. Our 
algorithm optimizes modularity computed for a fundamental object — the word-document 
bipartite graph. 

At a competitive quality of the output, our algorithm's main boast is its sp eed: Using the 



resul ts on the clustering of a large web-graph (about one billion of the edges) (iBlondel et al 



20081 ). we estimate the time complexity of the clustering task for a collection of 10 millions 
of documents (each document about the average size of the documents from 20 Newsgroups 
collection) as several hours for a typical hardware. 

We conclude that our algorithm can be used for clustering very large document col- 
lections in reasonable time. With our algorithm, the size of amenable collections can be 
increased at least an order of magnitude. 

We believe that using our algorithm opens up new possibilities for automated structuring 
of the enormous number of text documents available via the web. 
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